{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Regression "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Regression is about predicting a numeric output for a given sample. A labeled regression sample is made up of a bunch of features and a number. The number is usually continuous, but it may also be discrete. We'll use the Trump approval rating dataset as an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:47:20.479638Z",
     "iopub.status.busy": "2023-12-04T17:47:20.479153Z",
     "iopub.status.idle": "2023-12-04T17:47:20.897979Z",
     "shell.execute_reply": "2023-12-04T17:47:20.897633Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Donald Trump approval ratings.\n",
       "\n",
       "This dataset was obtained by reshaping the data used by FiveThirtyEight for analyzing Donald\n",
       "Trump's approval ratings. It contains 5 features, which are approval ratings collected by\n",
       "5 polling agencies. The target is the approval rating from FiveThirtyEight's model. The goal of\n",
       "this task is to see if we can reproduce FiveThirtyEight's model.\n",
       "\n",
       "    Name  TrumpApproval                                                           \n",
       "    Task  Regression                                                              \n",
       " Samples  1,001                                                                   \n",
       "Features  6                                                                       \n",
       "  Sparse  False                                                                   \n",
       "    Path  /Users/max/projects/online-ml/river/river/datasets/trump_approval.csv.gz"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from river import datasets\n",
    "\n",
    "dataset = datasets.TrumpApproval()\n",
    "dataset"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This dataset is a streaming dataset which can be looped over."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:47:20.899687Z",
     "iopub.status.busy": "2023-12-04T17:47:20.899562Z",
     "iopub.status.idle": "2023-12-04T17:47:20.910834Z",
     "shell.execute_reply": "2023-12-04T17:47:20.910553Z"
    }
   },
   "outputs": [],
   "source": [
    "for x, y in dataset:\n",
    "    pass"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at the first sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:47:20.912303Z",
     "iopub.status.busy": "2023-12-04T17:47:20.912228Z",
     "iopub.status.idle": "2023-12-04T17:47:20.921120Z",
     "shell.execute_reply": "2023-12-04T17:47:20.920869Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'ordinal_date': 736389,\n",
       " 'gallup': 43.843213,\n",
       " 'ipsos': 46.19925042857143,\n",
       " 'morning_consult': 48.318749,\n",
       " 'rasmussen': 44.104692,\n",
       " 'you_gov': 43.636914000000004}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x, y = next(iter(dataset))\n",
    "x"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A regression model's goal is to learn to predict a numeric target `y` from a bunch of features `x`. We'll attempt to do this with a nearest neighbors model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:47:20.922589Z",
     "iopub.status.busy": "2023-12-04T17:47:20.922495Z",
     "iopub.status.idle": "2023-12-04T17:47:20.934349Z",
     "shell.execute_reply": "2023-12-04T17:47:20.933989Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from river import neighbors\n",
    "\n",
    "model = neighbors.KNNRegressor()\n",
    "model.predict_one(x)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model hasn't been trained on any data, and therefore outputs a default value of 0.\n",
    "\n",
    "The model can be trained on the sample, which will update the model's state."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:47:20.935928Z",
     "iopub.status.busy": "2023-12-04T17:47:20.935816Z",
     "iopub.status.idle": "2023-12-04T17:47:20.944466Z",
     "shell.execute_reply": "2023-12-04T17:47:20.944074Z"
    }
   },
   "outputs": [],
   "source": [
    "model.learn_one(x, y)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we try to make a prediction on the same sample, we can see that the output is different, because the model has learned something."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:47:20.945980Z",
     "iopub.status.busy": "2023-12-04T17:47:20.945898Z",
     "iopub.status.idle": "2023-12-04T17:47:20.954845Z",
     "shell.execute_reply": "2023-12-04T17:47:20.954612Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "43.75505"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.predict_one(x)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Typically, an online model makes a prediction, and then learns once the ground truth reveals itself. The prediction and the ground truth can be compared to measure the model's correctness. If you have a dataset available, you can loop over it, make a prediction, update the model, and compare the model's output with the ground truth. This is called progressive validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:47:20.956655Z",
     "iopub.status.busy": "2023-12-04T17:47:20.956493Z",
     "iopub.status.idle": "2023-12-04T17:47:22.658069Z",
     "shell.execute_reply": "2023-12-04T17:47:22.657771Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MAE: 0.310353"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from river import metrics\n",
    "\n",
    "model = neighbors.KNNRegressor()\n",
    "\n",
    "metric = metrics.MAE()\n",
    "\n",
    "for x, y in dataset:\n",
    "    y_pred = model.predict_one(x)\n",
    "    model.learn_one(x, y)\n",
    "    metric.update(y, y_pred)\n",
    "    \n",
    "metric"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a common way to evaluate an online model. In fact, there is a dedicated `evaluate.progressive_val_score` function that does this for you."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:47:22.659592Z",
     "iopub.status.busy": "2023-12-04T17:47:22.659489Z",
     "iopub.status.idle": "2023-12-04T17:47:24.299285Z",
     "shell.execute_reply": "2023-12-04T17:47:24.299054Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MAE: 0.310353"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from river import evaluate\n",
    "\n",
    "model = neighbors.KNNRegressor()\n",
    "metric = metrics.MAE()\n",
    "\n",
    "evaluate.progressive_val_score(dataset, model, metric)"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "e6e87bad9c8c768904c061eafcb4f6739260ff8bb57f302c215ab258ded773dc"
  },
  "kernelspec": {
   "display_name": "Python 3.9.12 ('river')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
